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Abstract 



A parallel algorithm for solving a series of matrix equations with a constant tridiagonal matrix and different 
right-hand sides is proposed and studied. The process of solving the problem is represented in two steps. 
C I The first preliminary step is fixing some rows of the inverse matrix of SLAEs. The second step consists in 
I ^ ■ calculating solutions for all right-hand sides. For reducing the communication interactions, based on the 
I formulated and proved main parallel sweep theorem, we propose an original algorithm for calculating share 
. components of the solution vector. Theoretical estimates validating the efffciency of the approach for both 
the common- and distributed-memory supercomputers are obtained. Direct and iterative methods of solving 
""J^. a 2D Poisson equation, which include procedures of tridiagonal matrix inversion, are realized using the mpi 
' technology. Results of computational experiments on a multicomputer demonstrate a high efffciency and 
i scalability of the parallel sweep algorithm. 

' Key words: Parallel algorithm, Tridiagonal matrix algorithm (TDMA), Thomas algorithm. Sweep 
, Method, Poisson equation. Alternating Direction Method, Fourier Method 
^ ; PACS: 02.60.Dc, 02.60.Cb, 02.70.Bf, 02.70.Hm 
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Q\ ' 1. Introduction. 
^ ■ 

OO ' The progress in numerical methods of solving " complex problems" is impossible without applying pow- 

' erful parallel computer systems. Thus, it is necessary to investigate numerical algorithms that allow for 

■ efficient parallel implementation. 

' The problem of solving tridiagonal systems of linear algebraic equations [J, [3, I3| is one of the most 

0^ . frequently solved problems in computational mathematics. The tridiagonal SLAEs arise in three-point 

' approximation of problems for ordinary differential equations of second order with constant and variable 

. coefficients, and also in realization of difference schemes for equations in partial derivatives 0, @, @|- As a 

' rule, tridiagonal SLAEs are solved using various versions of the direct difference equation method, that is, 

'jjj . a sweep method: monotonic, nonmonotonic, flux and orthogonal [ij, [d. Ill, [sl, l9| . 

' Development and irnprovement of parallel sweep algorithms is of great interest, which is confirmed by 
numerous publications 0, [13, El [11 [ll, Q El 

concerned with this difficult problem. Analyzing papers 



dealing with this topic, we can conclude that presently available parallel sweep algorithms are insufficiently 
efficient and, what is more important, they are insufficiently scalable. The primary cause is that efficient, 
in a theoretical aspect, parallel algorithms realized on different multiprocessor computer systems become 
disadvantageous due to the presence of such operations as communications and synchronizations. 

Solving problems by finite-difference methods frequently requires to solve not one, but a series of tridi- 
agonal SLAEs with different right-hand sides, the number of problems in the series can reach thousands. 
Thus, the problem of designing an efficient parallel sweep algorithm for solving series of tridiagonal systems 
of equations deserves consideration. 
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In this paper, we propose a new approach to designing a paraUel sweep algorithm for solving a series 
of tridiagonal SLAEs with a constant matrix and different right-hand sides. The process of solving the 
problem is subdivided into two steps. The first, preliminary step consists in fixing some rows of the SLAE 
inverse matrix by means of a sequential procedure. Then follows calculation of solutions for all right-hand 
sides; doing so, for increasing the algorithm efficiency using the formulated and proved main parallel sweep 
theorem, we proposed an original algorithm for calculating individual components from the solution vector. 

2. Statement of the problem. 

The series of systems of algebraic linear equations with a symmetrical constant tridiagonal matrix means 

AX„=F„, n=I,...,iV. (1) 
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, where N is the number of problems in the series. 

Assuming that system ([1]) is nondegenerate, we aim for designing a parallel algorithm for solving the 
problem and subsequent realizing on a multicomputer 1^, 17 1. 

Data decomposition. The computational and communication complexity of a parallel algorithm, hence, 
the execution time depend drastically on the way of decomposition of problem data. Let us dwell on the 
problem of mapping the data of problem ^ onto a set of processor elements (PEs). 

Designing algorithms for distributed-memory supercomputers, it is necessary to take into account the 
fact that local data (data in the local memory of the same PE) are accessed much faster than data on a 
distant PE. Thus, even during designing the parallel sweep algorithm, we perform computing such as to 
minimize communication interactions by means of increasing the portion of local calculations (calculations 
performed with local data). 

We ground the proposed parallel sweep algorithm on the following specification of data distribution 
between PEs: 

1. Assuming that the number of PEs is p, divide vectors Fn and Xn into subvectors Qn;, Um as foUowfi 

(2) 



Fn — (Qni, Qn2J Qnp) — (^/" i /2\ ■ ■ • : /Iize{F„} -1 > /sLe{F„} 

X„ = (U„^,U„2, ...,Unp) = (^j^jXj, ...,a;"j^g|x„}-i'^«ze{x„}) • (3) 

2. The sizes of Q„. and U„. are chosen under the conditions 

size{Q„.} = s«ze{U„.} > 2 i = 1, 

Ya=i ■s*2e{Q„.} = Yfi=i size{\Jn,} = size{Fn} = size{X„} 

3. Demand that the pair of subvectors (Qni, UnJ belong to PE number i . 

4. The row of A number j is on the same PE as the pair of elements (a;",/") from Q,®. 

We should note that the specification of decomposition of Xn, Fn, and A rules out absolutely duphcation 
of the problem data. Exactly for this distribution we will design the parallel sweep algorithm for solving of 
problem ([T|). 



^The number of elements of a vector V, is denoted as size size{V}. 
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3. Parallel sweep algorithm 

3.1. Basic algorithm 

Lemma 1. Let the tridiagonal system of linear equations ylX„ = Fn be divided into subsystems of the form 

^iU„;=Qn., i = l,...,p, (4) 



1 

O-li+2 bi.+3 



a-ii+ti-i 
1 



ti = sizejUnJ, Ij ^y^Ji 



k=l 

according to the proposed approach, and let we know the values of elements^ /irsijUnj} ZastjUnj}, 
then systems ^ can be solved independently, equality l^j) will be fulfilled. 

The proof of the lemma follows from the tridiagonal matrix of the structure itself. 
Algorithm 1. Based on Lemma 1, we can propose the following parallel sweep algorithm 

1. Decompose initial system ([T]) into subsystems of form 

2. Find solutions in boundary elements f ir st{\5 -a^} , ZastjUnj}. 

3. Compute Xn , by solving independently subsystems. 

Thus, the parallel sweep algorithm enables to reduce the solution of problem ([T]) to p independent 
subproblems of form (jj]), if values of /i7'st{U„j} and last{\]-a^^\ are known. However, we have still to solve 
the issue of the efficient way of computing "boundary" elements, i.e., it is necessary to design a parallel 
algorithm for computing an individual component of the solution vector Xn. 

3.2. Computing arbitrary solution component. 

Lemma 2. Let B be the symmetrical tridiagonal matrix. Then the value of the kth component of solution 
vector (Y)j, of the equation B~Y = F can be found as 



(Y), = Gk^F, 

where the vector Gk is the solution of the following system of equations 

BGk = Ck, 

ek is the unit vector. 

Proof. Let B.k be designated by the fcth column of B, and Bk. by the fcth row, respectively 



(5) 
(6) 



B — (B.i, B.2, B.n) — 



Bi. 

B2. 

B„ 



^The first and last elements of some vector V, arc denoted as /jrst{V} and iast{V}. 
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By virtue of the definition of the inverse matrix BB^^ — I, the solution of system © is the fcth column of 
B^^ and the fcth row (under the condition of matrix symmetry). 
From this follows 

Gk = S-lek = B7,l = (B-l)^ (Y),^B-iF = GTF. (7) 

Thus, the lemma has been proved. 

Let us represent the algorithm for computing the arbitrary solution component for the series of tridiagonal 
equations of form 

Algorithm 2. For calculating M different components of the solution vector (Xn)^, , n ~ 1,...,N; 
m = 1, A/ from series of equations (H]), : 

1. Find Gk„, m = 1, M, as the solution of ^Gk„ = Gk^. 

2. Define (Xn)j. for the whole series as 

(X„)fc^ = Gl^Fn, n = 1, ...,N, m = 1, M. 

Thus, Algorithm 2 makes it possible to find a separate component of the solution vector. It is important 
that for different km the values of (Xn)^ can be calculated independently. We should note that the vector 
Gk„ does not depend on the right-hand side of H]) , hence, it may be determined once for all Fn- 

Let us consider Algorithm 2 in application to the parallel sweep algorithm. According to Algorithm 1, 
at the second step it is required to calculate values of elements first{XJni}, Zast{U„j} for all PEs. The 
specified data decomposition assumes that only one subvector XJ^ and one subvector Qn. are placed on a 
single PE, therefore, each PE will compute two elements {first{Uni},last{\Jni}) from the solution vector. 
Parallel realization of Algorithm 2 entails a great difficulty, namely, according to (O , each PE will have to 
perform about 0(size{X„}) operations regardless of the number of involved computational resources. 

This causes the problem of modifying Algorithm 2 in such a manner that the number of operations per 
PE is about 0(size{U„.}). 



3.3. The main parallel sweep theorem. 

We will start designing an efficient algorithm for computing "boundary" elements; for illustration, let us 
consider the following boundary-value problem 

— = ip{xo)^0, if{xi)^0 (8) 

As is known fl 8] , the solution of problem ([8|) may be represented in the integral form via the corresponding 
Green function rj 

(fix) = / G{x,s)p{s}ds. (9) 



Let us partition the interval {xo,xi) by three points {2:1/4,11/2,3:3/4} and define the right-hand side of 
dH]) as 

{0, Xq < X < Xi/4^ 

k{x), Xi/4 <X< Xi/2 , p. 

0, Xi/2 <x< 0:3/4 ^ ' 

0, 2:3/4 <x<xi 

According to the solution at the points of partitioning may be defined as 



In this case, we restrict ourselves only to the fact of its existence. 
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Figure 1: 
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G(xi/4, s)fc(s)ds, ip(xxj2)^ \ G(a;i/2,s)fc(s)ds, 

•^a;i/4 



2=1/4 

:i;i/2 



G(x3/4, s)fc(s)ds. 



(11) 



Another way for finding the solution of equation ([8|) at the point 0:3/4 without calculating the integral of 
form ([H]), is as follows: since the point 2:3/4 belongs to the interval (xxji^xx) he solution at it should satisfy 
the equation 



d^ 
dcc^ 



0, 



with the boundary conditions 

<p(a;i/2) 



G(a:i/2, s)fc(s)ds, (^(a:i) = 0. 



(12) 



(13) 



^1/4 



It is extremely important (from the viewpoint of computation) that the solution of problem (fT^ . (fO|l is 
represented analytically [3] 



(p{x) = (^(xi/a)' 



X — X\ 



X\ - Xi/2 

Thus, the solution of (O with the right-hand side pH)) at the points {2:1/4, a:i/2, 2:3/4} is as follows 

v{xi/4) = G{xi/i,s)k{s)ds, ip{xi/2)= G{xi/2,s)k{s)ds, 

, . . .2:3/4-2:1 



(14) 



(15) 



Xl - Xi/2 

Comparison of (jlip with (jlSp in their computation complexity shows evident advantage of the latter 
because it is required to compute less integrals of form Q . 

For the arbitrary function p{x) we summarize the obtained result as a theorem. 

Theorem 1. It is required to find the solution of boundary problem Q) at points with the coordinates 
{xi I xo < 2:i < 2:Ar, Xi < Xi+i, i = 1,...,N— 1}. Then the following identity takes place 



N 



jlXi — X]\[ ^ 

j=t+l 



I, Xi Xq 



a" , I = 1,...,7V- 1, 

- ^-t-r 1 - xo 



G{xt,s)p{s)ds, i = 1, iV - 1, 



G{xi-i,s)p{s)ds, z = 2, ...,iV. 



(16) 



(17) 



Let us formulate and prove the main parallel sweep theorem. 



Theorem 2. Let we have the nondegenerate system of linear algebraic equations with the tridiagonal matrix 
AK. = F of dimension n. Then for each solution vector component from the set 



= {(X)^. |1 < rij < n, Ui < rij+i, i = 1, ...,p < n} 
the following identity holds true 

P+i 



rii-l 

(^n.= E CP)j<-^,P K+i=n + l) 

j=ni-i 
■rii-l 

pi= E ("0 = 1) 



(18) 



(19) 



(20) 



where 



(21) 
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1 



1 

dk+l bk+2 Cfe+2 



an-2 bn-1 Cn-1 
O-n-l bn 



(22) 



e" = (l,0,0,...,0)^, eL = (0,..., 0,0,1) 



Proof. Let 



3=1 j=i 



Define elements of F,- as follows 



' 0, i < Uj-i 
(F)z ' "i-i <i<nj 



0, i > rij 

Then the solutions (Xj)^^_^ , (X^)^^ are unambiguously defined by the following expressions 



(23) 



(24) 
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Uj—l 



k—rij- 1 



(25) 



k—iij 



According to Lemma 1, define (X^)^^ for 1 < fc < j — 2 as 



L ^ L 



, . . . , 



(26) 



(27) 



and for j + l<k<pas 

vR _ I ^R ^R ^R ^R \ 

Denoting 

Z«^ = (i?«J-^«, ZL__^ = (i3L_^)"'eS (28) 
we obtain the general formula for computing (Xj)^ , i — I, ...,p : 

(X,)„. = " (29) 

Substituting dH]) into (Hg, yields HH). 
The theorem has been proved. 

Remarkl. Since the vectors Z^:^ do not depend on the right-hand side of ([T]), therefore, they may be 
defined once for the whole series of problems. 

Remark2. li A — , then according to Lemma 2 the quantities f3^'^ may be defined as follows 

Pt= E (F),(G„.-0,, 

(30) 

Pi- E (F),(GnJ,, 

AGk„=ek„. (31) 



Based on Theorem 2, we formulate the algorithm for computing several components from the solu- 
tion vector for the series of tridiagonal equations ^ . 



Algorithm 3. For computing M different components of the solution vector (Xn)^. , n — 1,...,N; 
m = 1, M from the series of equations H]), follow: 
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1. Preliminary step. (Performed once for the whole series of problems). 

1.1 Find Gk„, m — 1, M, from the solution of equation (|3ip . 

1.2 Find Z^_^^, m = 1, M from (EH). 

2. Step of obtaining solutions. (Performed for each right-hand side F F„, n — 1, ...,N.) 

2.1 Determine (3^^, m^l,...,M according to §Q . 

2.2 Determine (Xn)^, , m = 1, ...,M according to p^ . 

Elementary counting of arithmetic operations at the preliminary step of Algorithm 3 shows that its 
realization by formulas (|2ip and (pijl requires w 24NP operations. For calculating components ([T5|) by 
formulas (fT9 |) .(|30 p requires 12N + M"^ operations. 

An important property of Algorithm 3 is that at each of four steps of the algorithm, calculations for 
different km are independent. Therefore, the number of arithmetic operations per PE is 24NM/p for the 
first step and (12A^ + M'^)/p for the second, respectively. 

Let us analyze the efficiency of Algorithm {1,3} without regard to communication interactions. As the 
criterion, we will enter the speedup 

S = T^/Tp, 

where Ti is the number of operations for solving one problem from series ([TJ by a sequential sweep algorithm, 
and Tp , by Algorithm {1,3}. Assuming p = M, Ti — 8N, where N is the number of unknowns and 
Tp = (l2iV + 2Af2) /p, we haveS 

_ 8Np 

^-12N + p^ ^-^^^ 

From ([5^ it follows that the speedup value increases monotonically as the number of PEs grows, and 
then starting from some p > po decreases monotonically to zero. 

Evidently, the minimal time of problem solution is achieved for the number of PEs 

f SNp \ r— 

Po = max — 5- = v6iV, 



P \l2N+p- 




Thus, the efficiency of parallel Algorithm {1,3} for the maximum possible speedup is « 30%. The 
remaining 70% computations fall on "additional" operations for maintaining parallelism. From (j30p and 
([T^ it follows that the volume of these additional computations has order O(p^), where p is the number of 
PEs. 

For comparison, that difficulty is also characteristic of the algorithm proposed in 0,11 , where for 
computing of first{\J^.^ and last{\Jn.^ (in our designation) elements it is necessary to solve the tridiagonal 
system of equations with the number of unknowns equal to the number of PEs. Since the authors propose to 
calculate the solution by means of a sequential sweep algorithm version, the number of additional operations 
will be of order 0{p), but contrary to ((T^ . parallel compu ting is not allowed. 

Thus, in Algorithm {1,3} as well as in the algorithm [lO, [lO], the time of computing /irst{U„j} and 
last{\5ni} elements depends linearly on the number of PEs. 

For increasing the efficiency of Algorithm {1,3}, we will task to reduce the number of arithmetic opera- 
tions in realizing formula. 



''The obtained estimate Il32|l is conditional and represents rather the qualitative behavior of the speedup dependence on the 
number of unknowns and PEs. 
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3.4- Parallel dichotomy algorithm for solving tridiagonal SLAEs. 

It is required to calculate the components of the vector of solution defined in p^ . it is assumed that 
p = - 1 < n, po > 0. 

Let us enter into the consideration the sets 



i-l 



= {(x)„^ |(x)„^ = 2L'°s^^'J+i-*fc, ^ ^^^^^^2' _ \ [y^^^j ' (33) 

where i = 1, [log2(p)J + 1. 
It is evident that 

Llog2(p)J+l 

1=1 

Theorem 3. Let the components of the solution vector from the set flj, j > 1 and the quanities 
Pnl^, Z^^^,Gnj, i = 1, j — 1, 1 be determined. Then for all {X.)^^, E ftm, j <m< [log2(p)J+l, 
the following identity holds true 

(X)„^ = E (Z«^.)^^ + E {'^n,)^^ + 4, + 4„ (34) 

j=/ci+l ' j=i+l 

0, fcl = 



(X),^ (Z^J„^ - (Gk,+i),, (F),^ (Z«^^0„. > > 

0, fc2 = J3 + 1 



(X).,(ZLJ^^, fc2<p+l 
where fci and fc2 are defined as follows 

fcl = min (ui — nt), fc2 = niin (nt — nA 

t, t<fc, (x)„^e(J7j Uf-^o}) t, t>fc, (X)„^e(f2, U{^P+i}) 

Proof. Validity of the theorem follows from the fact that the known components from the set ftj partition 
the initial system according to Lemma 1 into independent subsystems and the solution of each subsystem 
can be represented as the sum of general solution of a homogeneous equation and a partial nonhomogeneous 
equation [21 1. 

Based on Theorem 3, we formulate the efficient parallel algorithm for computing separate components 
from the solution vector 



Algorithm 4. Dichotomy algorithm. Calculation of M different components of the solution 
vector (Xn);. , n — 1, N , m — I, M from series of equations requires: 

1. Preliminary step. (Performed once for the whole series of problems). 

1.1 Find Gk„, to = 1, ...,M, from the solution of (f?T|) . 

1.2 Find Z^,^', TO = 1, M from (^1]) . 

2. Step of obtaining solutions. (Performed for each right-hand side F„, n = 1, ...,N.) 

2.1 Find (3^;^, TO = 1, M according to ^ . 

2.2 Calculate in ascending order of index i = 1, [log2(p)J + 1 the components of the solution 
vector (X„)j;,^ ^i, using ([M]) . 



Remarks. Elements belonging to the same set Hi, can be calculated independently. 
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Let us analyze the issue of computational stability of Algorithm 4. We will say that Algorithm 4 is stable 
if for ((31 for all j 

<1, (35) 



where 



||X||^=max{|(X)J} 



Let us formulate stability criterion of Algorithm 4. 

Theorem 4. Let the matrix A have the diagonal dominance 0/ 

> |a.| + |c,|, i^2,...,N-l, (36) 

|&i|>|ci|, \hN\>\aN\, (37) 
and at least in one of inequalities or |j'7| ) , strict inequality holds, then Algorithm 4 is stable. 



Proof. If the matrix A has the diagonal dominance, then obviously the matrices B^'^ also have a diagonal 
dominance. Following the sweep algorithm the solution of system B^Z^ = e^ may be written as 



i,...,fc- 1, 



6j + aitti-i ' 



i — 2, fc, 



From conditions 
place 



ai = -ci/fli. 

(13 7p follows the inequality \ai\ < 1 [2], from where the following estimate takes 



■i+i 
Ha, 

i—k 



< 1 



We can similarly show that \zf\ < 1. Thus, the presence of diagonal dominance entails stability of Algorithm 
4. 

The theorem has been proved 

Remark 5. Since for calculating all elements from the set 17^ to perform Oij)) arithmetic operations 
regardless of the index i, for realizing Algorithm 4, computing p components from the solution vector 
requires 0{p\og2p) operations. 



Comparing the dependence of the speedup on the computing time (Fig. 2) for the dichotomy Algorithm 

we conclude that the dichotomy algorithm efhciency for few PEs is comparable 

possess 



ependen 

{1,3} and algorithm O,!!^, 

with that of Algorithms {1,3} and [i3,[23|. For a great number of PEs, Algorithms {1,3} and [idli 



a nearly zero speedup, whereas the dichotomy algorithm efficiency remains quite high. 
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Figure 2: Speedup versus the number of PEs. size{(X.)^} = 1024 

3. 5. An example of applying the dichotomy algorithm. 

For illustrating application of the dichotomy algorithm, we will consider the problem of definition 

1] = {(X),,(X)2,(X)3,,...,(X),J 
Let us define the sets fi^, i = 1, [Zog2l5j +1 = 4 according to (I33p . 
f^i = {(X)J, 

112 = {(X)4,(X)i2}, 

113 = {(X)2,(X)g,(X),o,(X)iJ, 

= {(x)i , (x)3 , (x)5 , (x)7 , (x)9 , (x)ii , (X)i3 , (X)i J . 

Then we calculate at first all elements from and then ^2, ^3, ^4 (Fig- 3). 

Step 1 0000000l0]0000000 

Step 2 ooo[olooo • ooo[olooo 

Step 3 0^0 • 0[0]0 • 0[0]0 • 0\0\0 

Step 4 [0]«[0]»[0]»|0] • [0]»|0]»|0]«[0] 

Figure 3: The order of computing the elements from the set Q. 

Thus, the initial system as a result of calculating at Step 1 the elements of fli is divided into two 
independent subproblems, into four independent subproblems at Step 2 after calculating the elements from 
fl2 , etc. until calculating the elements from il. 

3.6. Nonsymmetrical matrices 

Until now, it was supposed that the matrix of tridiagonal SLAB ([T]) symmetrical. This constraint restricts 
considerably the class of problems for which the parallel sweep algorithm is applicable. 
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Let us consider problem ([T|) with the symmetrical tridiagonal Jacobian matrix whose symmetrical el- 
ements have the same signs (aiCi_i > 0). In this case, there exists a similarity transformation with the 
diagonal matrix T such that the similar matrix A = T^^AT is symmetrical 22] ■ The elements of the 
diagonal matrix T are defined by the following recurrent relationships 



T = diag{tk}, ifc+i =tfe(afc+i/cfc)i/2, /c = l,...,iV-l 



(38) 



Thus, the series of SLAEs with the asymmetrical Jacobian matrix can be solved by the parallel sweep 
algorithm if the SLAE matrix is preliminary symmetrized via the similarity transformation. 

In the general case, when the tridiagonal matrix is not symmetrical or cannot be symmetrized, equality 
([5]) from Lamma 2 is no longer true. In this case, for determining rows of the inverse matrix, one can use 
the explicit representation of its elements IJ, |27 1 



4^' = < 



, where 



n :^ — ' 



fe=l 



C/c+1 



(39) 



(40) 



(41) 



4. Examples of applying the parallel sweep method. 

For estimating the efficiency of the parallel algorithm of solving the series of tridiagonal SLAEs 
we propose, using it as a basis, a parallel realization of the methods of solving the Poisson equa- 
tion. Let us consider the Dirichlet's problem in a rectangle with the homogeneous boundary conditions 
Go = {0 < a;^ < Z^., a = 1,2} 



Am : 



0. 



The corresponding difference approximation of second order of accuracy is 



At; = ~f{x), X G uoh-, v\ 
1 







where 



(42) 



(43) 



= {xj = (ift-i, j/12) , i = 0, ...,7Vi, j = 0, ...,iV2} 



(44) 



is a mesh with steps hi and /12, Ih is the mesh boundary. 

We will consider the variable separation method (Fourier method) [1, [S^l Alternating Direction Method 
(ADI) f5!,li,'24l| with application to solving problem (l43l) . 

a. Variable separation method. Since the function Uij vanishes if j — and j — N2, and the 
mesh function fij is given for 1 < j < A^2 ^ 1, they may be represented as a series in eigenfunctions of the 
difference operator A2 [H, [3| : 



(Aiy) = 



, 1^22; j — 



hi 



(45) 



12 



N2-1 

1=1 

N2-1 



1=1 



N2 
nlj 



0<j<A^2, 0<«<iVi, 



(46) 



A.- E /^W^i'^l]^)' 1<J<^2-1, l<*<A^i-l. 



Substituting (gB]) into (021) yields 



Ar2-1 



E ^ /ir' - 2«,(0 + u,_i(0] - 4/i2"2^,(/) sin^ + /,(/) \ sin = 



2N2 



No 



(47) 



From this, due to orthogonality of the eigenfunctions the amplitudes of harmonics of the potential 
Ui{l), I = 1, N2 — 1 can be defined as the solution of the following system of equations 



ttI 



hi 



2No 



u,+iil) - 2 + 4-i sin" — u,il) + u,_i{l) = -hif^{l), i = 1, ...,Ni - 1, 



(48) 

Uo{l) = UNiil) 0. 

The sums (jH]) should be evidently computed using the fast discrete Fourier transform 0, [13] ) and for 
finding the solutions from the series of equations we should use the sweep method. 

Let us dwell on some aspects of realizing the parallel sweep algorithm within the scope of the variable 
separation method. 

One of the constraints on the parallel sweep algorithm is that all SLAEs from the series of problems ([1]) 
contain the same fixed matrix. The tridiagonal matrices from have the form 



Bi = {T-diI), =.4j|sin2^, 1 = 1,. 



2N2 



.No 



-2 1 
1-21 



(49) 



(50) 



It is evident that for (|49p the condition of matrix constancy for all right-hand sides is not fulfilled, hence, 
in this formulation, problem (|48p cannot be solved efficiently by the proposed algorithm. However, we can 
extend Algorithm 1 for solving the series of Poisson equations on the fixed mesh 



Au = ~fn{x), n = l,...,N. 
In this case, it is required to solve the following problem 

BiUnil) ^ gn{l), l^l,...,N2-l, 71=1, ...,7V. 

The set of equations ((5^ may be considered as a set of problems of form ([T]) for the fixed 



(51) 



(52) 



b. Alternating Direction Method - belongs to the class of methods based on the concept of fixing. 
The solution of stationary problem ()42p is found as the limit i — s- cx) of solution of the following unstationary 
problem 



du 
'dt 



= Au-f 



(53) 



with the same boundary conditions. 
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Let us consider the Peaceman-Rachford scheme know also an ADI method For this purpose, 

we represent the 2D difference Laplace operator as the sum of two operators A = Ai + A2 Then the 

iterative process of the ADI method for problem has the form 



1/2 

■ = Aim"+i/2 + A2M" - /, (54) 



,.n+l/2 

-= Aiw"+i/2 + A2u"+i - /. (55) 



^(2) 
' n 

The iterative parameters rj.^^ , rj^"^ should be chosen from the condition of minimum number of iterations. 
The problem of choosing the optimal parameters is comprehensively described, e.g., in [2, 4, 26., ,6j . 

Let us consider a case when the region G is a square with the side I = li = I2 and the mesh ui is uniform 
with A^i = N2 = N . Then in order that under any initial approximation wq the norm of initial error to be 
decreased 1/e times 

\\un-u \\d< e\\U(3-U \\d 

the number of iterations n must satisfy the condition 

n > no{e) = 0.2 In (4A^/7r) In (4/e) . (56) 

Taking into account the fact that the sequence of optimal parameters t^^^ , t^^-* , fc = 1, no , is cyclic 
and the series of SLAEs ([5^ , for the fixed n includes the constant matrix 

we conclude that at the preliminary step, it is sufficient to solve merely rto tridiagonal SLAEs. It should 
be noted that the value no is much less than the total number of equations whose solutions have to be found 
for achieving the desired accuracy. 



5. Computational experiments. 

As we have already mentioned, the parallel algorithms that are efficient from the theoretical viewpoint, 
when realized on supercomputers, may not ensure the expected reduction of the computation time. The 
primary reason is that in analyzing the efhciency of a particular algorithm, it is not easy to take into 
account all peculiarities of computer systems (memory operation, network throughput and latency, etc.). 
Thus, numerical experiments with model formulations of problems are an important stage of investigating 
parallel algorithms. 

As the model problem we considered the Dirichlet problem for the Poisson equation 

Au = — Stt^ sin(27ra;) sin(27ry), x—{xi,X2)^G, -u|p = 0. (57) 
G = {0 < x„ < 1, a = 1,2} 

For solving problem (j57p in the Fortran-90 language using the MPI technology we realized Fourier and 
ADI methods. The tridiagonal matrices were inverted by a parallel dichotomy algorithm. Equation (|57p 
was approximated on uniform mesh (|44|1 with Ni = N2 = 2*^ nodes. For the ADI method, the value of 
prescribed accuracy e was 10~^. 

Figure 3a represents calculation domain decomposition for the Fourier method. Solution of the tridi- 
agonal systems of equations was performed in the direction k2, and the Fourier transform was done in the 
direction ki. For the ADI method we chose a decomposition like a lattice (Fig. 3b) because the ADI method 
requires solution of tridiagonal SLAES in the directions x and y. 

The computing time was estimated as the average time of solving one problem like (157p from a series of 
100 problems 
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T — 

avr — 



^100 rpi 



100 



and the speedup time was calculated from the formula 



Sa 



T 

^ avr 



is the time of solving problem (j57p by the parallel algorithm, and Ti - the sequential algorithm. 



Q 



P11 


P12 


Pl3 


Pl4 


P21 


P22 


P23 


P24 


P31 


P32 


P33 


P34 



a) Method of variable separation b) Method of alternating directions 

Figure 4: Domain Decomposition 



Test calculations were performed on an MBC-lOOk supercomputer of the Interdepartment Center of the 
Russian Academy of Sciences; the supercomputer is based on Intel Xcon four-core processors operating at 
3 GHz in the Infiniband communication environment. 

Results obtained for the dependence of the computing time {Tavr) and speedup {Savr) for the Fourier 
and ADI methods are listed in Tables 1 and 2, and in (Figs. 5a, 5b, 6a, 6b). 

Based on the obtained results, we will point out the following: 

• For the Fourier and ADI methods, the dependence of the computing time on the number of processors 
is linear. 

• For computing by the ADI method, starting from some number of processors, the speedup is superlinear 
because as the number of processors grows, the data volume per PE decreases, therefore, they can be 
located completely in a faster memory cache. 

• The maximum performance of the Fourier method was 1700 equations/ sec. for a 512x512 mesh 
833 eqs./sec for 1024x1024, 417 eqs./sec for 20482048, 161 eqs./sec for 4096x4096, 56 eqs./sec for 
8192x8192, and 13 eqs. for 16384x16384, respectively. 

• When the number of nodes in one direction exceeds several times the number of nodes in another 
direction, the efficiency of the parallel Fourier algorithm is between 80% and 95% (Figs. 6a and 6b). 

Thus, as a result of our computational experiments we registered the presently record efficiency in solv- 
ing the Poisson equation on a multicomputer. These results were achieved due to applying the dichotomy 
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size 


512x512 


1024x1024 
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44 


1.3O-03 
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22 


1.2C-03 


96 


2.8e-03 


175 


9.4C-03 


224 


5c- 02 


221.6 
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1.4O-01 
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().2o-0:5 


3 10 
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7.59 



Table 1: Computing time (Tavr) and speedup (Saw) versus the number of processors for the Fourier method. 
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Table 2: Computing time (To^,.) and speedup (Savr) versus the number of processors for the ADI method. (M, the number 
of processors in directions fei and ^2, which enabled the minimal computing time.) 
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a) Method of alternating directions b) Method of variable separation 

Figure 5: Speedup versus the number of processors for different meshes 



algorithm for a series of tridiagonal SLAEs, which was specially designed for distributed-memory supercom- 
puters. We should note that the proposed algorithms will be no less efficient, but even more for implementing 
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a) b) 

Figure 6: Computing time (a) and speedup (b) for the method of variable separation in the case of a rectangular region versus 
the number of processors 



on shared-memory multiprocessor computer systems because the communication interactions are minimal 
in that case. 



6. Conclusions 

The proposed parallel sweep algorithm for solving a series of tridiagonal systems of linear algebraic 
equations has validated its efficiency as a result of computational experiments. The main feature of the 
algorithm is that it is required at first to perform some preliminary computations whose complexity is 
comparable with solving one problem, and then solve a number of SLAEs for different right-hand sides with 
a nearly linear speedup. Thus, we have developed and investigated a promising method for solving a series 
of tridiagonal SLAEs, whose efficiency and scalability are record for today. 
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